Done by Veronika Karpushenkova¶

worked on server /home/Veronika.Karpushenko/final_project/rnaseq

to save some space I will turn down warning notifications in cell output:

In [1]:
options(warn = -1)

Task:

The RNA-seq data will be used to find a correlation between the level of gene expression and methylation patterns. We will use DESeq2 to analyze differential expression, basic method of normalisation will be used (median-of-ratios). But before that, it will be necessary to make an annotation. Since we have salmon files ready, it will be necessary to match the transcript with the corresponding gene. For clusterization and further visualisation we will do regularized log normalisation to minimise the noise and stabilize dispersion.

Uploading packages¶

In [2]:
library(dplyr)
library(ggplot2)
library(DESeq2)
library(DOSE)
library(pheatmap)
library(tibble)
library(clusterProfiler)
library(DT)
library(rtracklayer)
library(tximport)
library(org.Hs.eg.db)
library(EnhancedVolcano)
library(KEGG.db)
library(msigdbr)
library(cluster)
library(factoextra)
Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: S4Vectors

Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which.max, which.min



Attaching package: ‘S4Vectors’


The following objects are masked from ‘package:dplyr’:

    first, rename


The following objects are masked from ‘package:base’:

    expand.grid, I, unname


Loading required package: IRanges


Attaching package: ‘IRanges’


The following objects are masked from ‘package:dplyr’:

    collapse, desc, slice


Loading required package: GenomicRanges

Loading required package: GenomeInfoDb

Loading required package: SummarizedExperiment

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: ‘matrixStats’


The following object is masked from ‘package:dplyr’:

    count



Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars


Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.



Attaching package: ‘Biobase’


The following object is masked from ‘package:MatrixGenerics’:

    rowMedians


The following objects are masked from ‘package:matrixStats’:

    anyMissing, rowMedians




DOSE v3.24.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use DOSE in published research, please cite:
Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609


clusterProfiler v4.6.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141


Attaching package: ‘clusterProfiler’


The following object is masked from ‘package:IRanges’:

    slice


The following object is masked from ‘package:S4Vectors’:

    rename


The following object is masked from ‘package:stats’:

    filter


Loading required package: AnnotationDbi


Attaching package: ‘AnnotationDbi’


The following object is masked from ‘package:clusterProfiler’:

    select


The following object is masked from ‘package:dplyr’:

    select




Loading required package: ggrepel


KEGG.db contains mappings based on older data because the original
  resource was removed from the the public domain before the most
  recent update was produced. This package should now be considered
  deprecated and future versions of Bioconductor may not have it
  available.  Users who want more current data are encouraged to look
  at the KEGGREST or reactome.db packages


Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Uploading files¶

We have RNA-seq files (salmon, transcriptome - gencode v46 or Ensembl v114)
iPSC-derived: GSE212252 (3 replicates)
Post-mortem: GSE96615 (3 replicates)

The files:
CULTURE_SRR21307327_Zaghi.sf CULTURE_SRR21307381_Zaghi.sf PM_BA9_SRR5343882_Rizzardi.sf PM_BA9_SRR5343889_Rizzardi.sf PM_BA9_SRR5343896_Rizzardi.sf CULTURE_SRR21307379_Zaghi.sf PM_BA9_SRR5343878_Rizzardi.sf PM_BA9_SRR5343886_Rizzardi.sf PM_BA9_SRR5343892_Rizzardi.sf

In [3]:
#PM corresponds to post-mortem, and CULTURE corresponds to iPSC-derived
files <- c(
  "data/CULTURE_SRR21307327_Zaghi.sf",
  "data/CULTURE_SRR21307381_Zaghi.sf",
  "data/CULTURE_SRR21307379_Zaghi.sf",
  "data/PM_BA9_SRR5343878_Rizzardi.sf",
  "data/PM_BA9_SRR5343882_Rizzardi.sf",
  "data/PM_BA9_SRR5343886_Rizzardi.sf",
  "data/PM_BA9_SRR5343889_Rizzardi.sf",
  "data/PM_BA9_SRR5343892_Rizzardi.sf",
  "data/PM_BA9_SRR5343896_Rizzardi.sf"
)
sample_names <- c(
  "CULTURE_327", "CULTURE_381", "CULTURE_379",
  "PM_878", "PM_882", "PM_886", "PM_889", "PM_892", "PM_896"
)

# create a sample table
sample_table <- data.frame(
  sample = sample_names,
  file = files
)
In [4]:
names(files) <- sample_names
In [5]:
head(sample_table)
A data.frame: 6 × 2
samplefile
<chr><chr>
1CULTURE_327data/CULTURE_SRR21307327_Zaghi.sf
2CULTURE_381data/CULTURE_SRR21307381_Zaghi.sf
3CULTURE_379data/CULTURE_SRR21307379_Zaghi.sf
4PM_878 data/PM_BA9_SRR5343878_Rizzardi.sf
5PM_882 data/PM_BA9_SRR5343882_Rizzardi.sf
6PM_886 data/PM_BA9_SRR5343886_Rizzardi.sf
In [6]:
# read one .sf file to get transcript IDs
library(readr)
sf_example <- read_tsv(files[1])
head(sf_example)
Rows: 192825 Columns: 5
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Name
dbl (4): Length, EffectiveLength, TPM, NumReads

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 × 5
NameLengthEffectiveLengthTPMNumReads
<chr><dbl><dbl><dbl><dbl>
ENST00000415118.1 8 900
ENST00000448914.1131400
ENST00000434970.2 91000
ENST00000631435.1121300
ENST00000710614.1161700
ENST00000605284.1171800

Accordong to the IDs above our genome transcriptome is Ensembl v114

I donloaded Homo_sapiens.GRCh38.113.gtf.gz:
wget https://ftp.ensembl.org/pub/release-114/gtf/homo_sapiens/Homo_sapiens.GRCh38.114.gtf.gz

In [7]:
gtf_file <- "Homo_sapiens.GRCh38.114.gtf.gz"

gtf <- import(gtf_file)

# extract transcript-to-gene mapping
tx2gene <- unique(as.data.frame(mcols(gtf[gtf$type == "transcript", c("transcript_id", "gene_id")])))

# remove version numbers to match .sf IDs
tx2gene$transcript_id <- sub("\\..*", "", tx2gene$transcript_id)

# save mapping
write.csv(tx2gene, "tx2gene.csv", row.names = FALSE)
In [8]:
tx2gene <- read.csv("tx2gene.csv")

We will make a list containing matrices for counts, abundances, and lengths at the gene level, which can be used for downstream differential expression analysis:

In [9]:
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion = TRUE)
reading in files with read_tsv

1 
2 
3 
4 
5 
6 
7 
8 
9 


transcripts missing from tx2gene: 14374

summarizing abundance

summarizing counts

summarizing length

For ChIP-seq analysis let's extract TPM-matrix:

In [10]:
# extract gene-level TPM matrix
gene_tpm <- txi$abundance
In [11]:
head(gene_tpm)
A matrix: 6 × 9 of type dbl
CULTURE_327CULTURE_381CULTURE_379PM_878PM_882PM_886PM_889PM_892PM_896
ENSG0000000000322.448834 22.270822 23.500443 0.876926 0.342681 1.648065 2.578503 3.453720 3.912682
ENSG00000000005 2.020044 1.779226 1.509333 0.163611 0.734623 0.000000 0.000000 0.164376 0.361587
ENSG0000000041999.721981100.960604104.68648614.57062024.31748116.78095434.03586927.39610732.453499
ENSG0000000045723.385931 23.811759 20.751628 5.942035 7.128343 2.891963 8.859937 3.19839111.518047
ENSG0000000046016.858568 16.662214 17.772674 5.360479 7.358298 3.259216 5.815689 3.429402 5.388200
ENSG00000000938 1.124916 0.674152 0.348630 0.169220 0.000000 0.096412 0.000000 0.000000 0.000000
In [12]:
culture_samples <- c("CULTURE_327", "CULTURE_381", "CULTURE_379")
pm_samples <- c("PM_878", "PM_882", "PM_886", "PM_889", "PM_892", "PM_896")
In [13]:
genes_bed <- read.table("genes.bed", header=FALSE, stringsAsFactors=FALSE)
In [14]:
avg_culture_tpm <- rowMeans(gene_tpm[, culture_samples], na.rm = TRUE)
avg_pm_tpm <- rowMeans(gene_tpm[, pm_samples], na.rm = TRUE)
In [15]:
tpm_df <- data.frame(
  gene_id = rownames(gene_tpm),
  avg_culture_tpm = avg_culture_tpm,
  avg_pm_tpm = avg_pm_tpm,
  stringsAsFactors = FALSE
)
In [16]:
genes_bed$gene_id <- gsub(" ;", "", genes_bed$V4)
In [17]:
bed_tpm <- merge(genes_bed, tpm_df, by.x = "gene_id", by.y = "gene_id")
In [18]:
write_bed_filtered <- function(df, tpm_col, lower, upper, filename) {
  subset_df <- df[df[[tpm_col]] >= lower & df[[tpm_col]] < upper, ]
  write.table(subset_df[, 2:7], file = filename, sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
}
In [19]:
# Culture TPM < 1
write_bed_filtered(bed_tpm, "avg_culture_tpm", -Inf, 1, "culture_tpm_less_1.bed")

# Culture 1 ≤ TPM < 10
write_bed_filtered(bed_tpm, "avg_culture_tpm", 1, 10, "culture_tpm_1_to_10.bed")

# Culture TPM ≥ 10
write_bed_filtered(bed_tpm, "avg_culture_tpm", 10, Inf, "culture_tpm_greater_10.bed")

# PM TPM < 1
write_bed_filtered(bed_tpm, "avg_pm_tpm", -Inf, 1, "pm_tpm_less_1.bed")

# PM 1 ≤ TPM < 10
write_bed_filtered(bed_tpm, "avg_pm_tpm", 1, 10, "pm_tpm_1_to_10.bed")

# PM TPM ≥ 10
write_bed_filtered(bed_tpm, "avg_pm_tpm", 10, Inf, "pm_tpm_greater_10.bed")

Peak annotation and differential peak analysis of ChIP-seq

In [20]:
library(GenomicRanges)
library(rtracklayer)
In [21]:
colnames(txi$counts)
  1. 'CULTURE_327'
  2. 'CULTURE_381'
  3. 'CULTURE_379'
  4. 'PM_878'
  5. 'PM_882'
  6. 'PM_886'
  7. 'PM_889'
  8. 'PM_892'
  9. 'PM_896'

We need a meta table which maps our samples to the corresponding sample groups that we are investigating. Our metadata will include three columns: “Genotype”, “sample” and “SampleID”.

In [22]:
meta <- data.frame(sample = sample_names) %>%
  mutate(
    Genotype = sapply(strsplit(sample, "_"), `[`, 1),
    SampleID = sapply(strsplit(sample, "_"), `[`, 2)
  )
knitr::kable(meta)

|sample      |Genotype |SampleID |
|:-----------|:--------|:--------|
|CULTURE_327 |CULTURE  |327      |
|CULTURE_381 |CULTURE  |381      |
|CULTURE_379 |CULTURE  |379      |
|PM_878      |PM       |878      |
|PM_882      |PM       |882      |
|PM_886      |PM       |886      |
|PM_889      |PM       |889      |
|PM_892      |PM       |892      |
|PM_896      |PM       |896      |

DESeq2 object¶

In [23]:
rownames(meta) <- sample_names
In [24]:
meta
A data.frame: 9 × 3
sampleGenotypeSampleID
<chr><chr><chr>
CULTURE_327CULTURE_327CULTURE327
CULTURE_381CULTURE_381CULTURE381
CULTURE_379CULTURE_379CULTURE379
PM_878PM_878 PM 878
PM_882PM_882 PM 882
PM_886PM_886 PM 886
PM_889PM_889 PM 889
PM_892PM_892 PM 892
PM_896PM_896 PM 896

To create DESeq2 object we will need the count matrix and the metadata table as input. We will also need to specify a design formula. The design formula specifies the column(s) in the metadata table and how they should be used in the analysis. For our dataset we only have one column we are interested in, that is ~ Genotype.

In [25]:
dds <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ Genotype)

# filter out genes with low counts
keep <- rowSums(counts(dds)) > 10
dds <- dds[keep, ]
dds <- DESeq(dds)
using counts and average transcript lengths from tximport

estimating size factors

using 'avgTxLength' from assays(dds), correcting for library size

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

In [26]:
dds
class: DESeqDataSet 
dim: 23483 9 
metadata(1): version
assays(6): counts avgTxLength ... H cooks
rownames(23483): ENSG00000000003 ENSG00000000005 ... ENSG00000310560
  ENSG00000310576
rowData names(22): baseMean baseVar ... deviance maxCooks
colnames(9): CULTURE_327 CULTURE_381 ... PM_892 PM_896
colData names(3): sample Genotype SampleID

Normalization¶

Now we will normalize the count data in order to be able to make fair gene comparisons between samples. We will ise the median of ratios method of normalization

In [27]:
sf <- estimateSizeFactorsForMatrix(txi$counts)
sizeFactors(dds) <- sf
In [28]:
sizeFactors(dds)
CULTURE_327
1.0223145187594
CULTURE_381
0.67191292956111
CULTURE_379
0.990176479011959
PM_878
1.04430860800395
PM_882
1.50298011628634
PM_886
0.852391424845682
PM_889
1.14089423739088
PM_892
0.806440209103821
PM_896
1.36534534933666
In [29]:
normalized_counts <- counts(dds, normalized=TRUE)
head(normalized_counts)
A matrix: 6 × 9 of type dbl
CULTURE_327CULTURE_381CULTURE_379PM_878PM_882PM_886PM_889PM_892PM_896
ENSG000000000031315.12181311.709361387.006203105.474607 27.12929249.587051172.3374341.573744242.691535
ENSG00000000005 30.4428 26.95968 22.924633 5.052421 15.03105 0.000000 0.0000 4.177684 5.764415
ENSG000000004191981.36142015.838902096.573597592.896670656.13703861.614995771.3531918.146060682.117333
ENSG000000004571411.15841444.242911262.574474734.895553584.22999451.403830610.2279325.478831735.425811
ENSG00000000460 824.2443 818.94785 875.480495536.184248487.95587411.587940324.3556282.767203278.231930
ENSG00000000938 27.4213 16.52287 8.564975 8.452427 0.00000 6.075086 0.0000 0.000000 0.000000

Quality control¶

We perform QC checks on the count data to help us ensure that the samples/replicates look good.

First, let’s try the most simple method of log2 transformation and plot counts of the two CULTURE replicates as a scatter plot:

In [30]:
# extract normalized counts
nts <- counts(dds, normalized = TRUE)

# log2 transform
nts_log2 <- log2(nts + 1)

# convert to data frame
df <- as.data.frame(nts_log2)

# check your column names
print(colnames(df))
[1] "CULTURE_327" "CULTURE_381" "CULTURE_379" "PM_878"      "PM_882"     
[6] "PM_886"      "PM_889"      "PM_892"      "PM_896"     
In [31]:
# plot
ggplot(df, aes(x = CULTURE_327, y = CULTURE_381)) + 
    geom_point() +
    ggtitle('log2(x+1) transformation') +
    theme_bw(base_size = 16)
No description has been provided for this image

Above we can see heteroscedasticity in the data, illustrating the dependence of the variance on the mean count value.

Let's apply a regularized log transform (rlog) of the normalized counts for sample-level QC as it moderates the variance across the mean, improving the clustering:

In [32]:
rld <- rlog(dds, blind=TRUE)
ggplot(as.data.frame(assay(rld)), aes(x = CULTURE_327, y = CULTURE_381)) + 
    geom_point() +
    ggtitle('rlog transformation')
No description has been provided for this image
In [33]:
# Extract the rlog-transformed counts matrix
rlog_mat <- assay(rld)
In [34]:
# convert to data frame and add gene IDs as a column
rlog_df <- as.data.frame(rlog_mat)
rlog_df$gene_id <- rownames(rlog_df)

# Write to CSV (gene IDs as first column)
write.csv(rlog_df, file = "rlog_gene_expression_matrix.csv", row.names = FALSE)

Principal Component Analysis (PCA)¶

In [35]:
plotPCA.mystyle <- function (object, meta, ntop = 500, returnData = FALSE)
{
    font.size <- 18
    rv <- rowVars(assay(object))
    select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
    pca <- prcomp(t(assay(object)[select, ]))
    percentVar <- pca$sdev^2/sum(pca$sdev^2)
    
    d1 <- data.frame(
        PC1 = pca$x[, 1], 
        PC2 = pca$x[, 2], 
        Genotype = meta$Genotype,
        name = rownames(meta)
    )
    
    ggplot(data = d1, aes(x = PC1, y = PC2)) +
        geom_point(aes(color = Genotype), size = 6) + 
        xlab(paste0("PC1: ", round(percentVar[1] * 100), "% variance")) + 
        ylab(paste0("PC2: ", round(percentVar[2] * 100), "% variance")) +
        theme_dose(font.size = font.size) +
        theme(
            legend.key = element_rect(colour = NA, fill = NA), 
            legend.title= element_blank(), 
            legend.text=element_text(size=font.size-2)
        )
}
In [36]:
plotPCA.mystyle(rld, meta)
No description has been provided for this image
In [37]:
plotPCA.mystyle <- function (object, meta, ntop = 500, returnData = FALSE)
{
    font.size <- 18
    rv <- rowVars(assay(object))
    select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
    pca <- prcomp(t(assay(object)[select, ]))
    percentVar <- pca$sdev^2/sum(pca$sdev^2)
    
    d1 <- data.frame(
        PC1 = pca$x[, 1], 
        PC3 = pca$x[, 3], 
        Genotype = meta$Genotype,
        name = rownames(meta)
    )
    
    ggplot(data = d1, aes(x = PC1, y = PC3)) +
        geom_point(aes(color = Genotype), size = 6) + 
        xlab(paste0("PC1: ", round(percentVar[1] * 100), "% variance")) + 
        ylab(paste0("PC3: ", round(percentVar[3] * 100), "% variance")) +
        theme_dose(font.size = font.size) +
        theme(
            legend.key = element_rect(colour = NA, fill = NA), 
            legend.title= element_blank(), 
            legend.text=element_text(size=font.size-2)
        )
}
In [38]:
plotPCA.mystyle(rld, meta)
No description has been provided for this image

Those above are PCA plots which might show and emphasize variations and strong patterns. We can see that clustering is quite obvious, and we can see that biological replicates cluster together on PC1 (82% of variance explained), and the samples from different treatment groups cluster separately.The percantage of variance corresponds to the amount of variation captured by the principal component. Post-mortem samples have bigger variance on PC2 and PC3, than culture samples which is due to biological heterogeneity (time after death, tissue degradation, and individual biological variability), and technical heterogeneity (degradation, differences in sample handling, or batch effects) of post-mortem samples.

Correlation Heatmap¶

In [39]:
rld_mat <- assay(rld)
In [40]:
rld_cor <- cor(rld_mat)
In [41]:
pheatmap(rld_cor, annotation = meta[,c(1,2)])
No description has been provided for this image

As expected, correlation between culture samples is really high. PM samples also correlate really well, and bigger correlation is seen between replicates.

Differential testing¶

We have one factor (Genotype) but let's copy it into distant column

In [42]:
meta$Factors <- meta$Genotype 
meta$Factors <- factor(meta$Factors, levels = unique(meta$Factors))
In [43]:
dds_factors <- DESeqDataSetFromTximport(txi, colData = meta, design = ~ Factors)
using counts and average transcript lengths from tximport

DE analysis¶

In [44]:
dds_analysis <- DESeq(dds_factors)
estimating size factors

using 'avgTxLength' from assays(dds), correcting for library size

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

Let's plot dispersion estimates:

In [45]:
plotDispEsts(dds_analysis)
No description has been provided for this image

To tell DESeq2 which groups we wish to compare, we supply the contrasts we would like to make using the contrast argument.

In [46]:
contrast_pm_vs_culture <- c("Factors", "PM", "CULTURE")
In [47]:
res_unshrunken_pm_vs_culture <- results(dds_analysis, contrast = contrast_pm_vs_culture, alpha = 0.05)

To generate more accurate log2 fold change estimates, we will shrink the log2 fold changes estimates toward zero when the information for a gene is low.

In [48]:
res_pm_vs_culture <- lfcShrink(dds_analysis, 
                               contrast = contrast_pm_vs_culture, 
                               res = res_unshrunken_pm_vs_culture, 
                               type = "normal")
using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).

Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
Reference: https://doi.org/10.1093/bioinformatics/bty895

In [49]:
knitr::kable(head(res_pm_vs_culture))

|                |    baseMean| log2FoldChange|     lfcSE|      stat|    pvalue|      padj|
|:---------------|-----------:|--------------:|---------:|---------:|---------:|---------:|
|ENSG00000000003 |  572.514554|      -2.757133| 0.5821881| -4.735112| 0.0000022| 0.0000094|
|ENSG00000000005 |   12.261410|      -2.201704| 1.0369156| -2.120160| 0.0339925| 0.0590397|
|ENSG00000000419 | 1175.115454|      -1.438254| 0.2348224| -6.124895| 0.0000000| 0.0000000|
|ENSG00000000457 |  839.959754|      -1.250333| 0.3197326| -3.910558| 0.0000921| 0.0002921|
|ENSG00000000460 |  537.750606|      -1.110226| 0.3106795| -3.573594| 0.0003521| 0.0009824|
|ENSG00000000938 |    7.448517|      -2.471943| 1.4383665| -1.709865| 0.0872908| 0.1334243|
In [50]:
plotMA(res_pm_vs_culture, ylim = c(-3, 5))
No description has been provided for this image

Let's also plot the unshrunken results.

In [51]:
plotMA(res_unshrunken_pm_vs_culture, ylim = c(-3, 5))
No description has been provided for this image

Here unshrunken data is little bit more dispersed, and noisy.

In [52]:
summary(res_pm_vs_culture, alpha = 0.05)
out of 28292 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 6372, 23%
LFC < 0 (down)     : 6621, 23%
outliers [1]       : 243, 0.86%
low counts [2]     : 4883, 17%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Extracting significant genes¶

Now we are ready to extract significant differentially expressed genes. We will define significant genes as those satisfying the |FDR p-value < 0.05| and |log2(Fold Change)| > 0.58:

In [53]:
padj.cutoff <- 0.05
lfc.cutoff <- 0.58

res_pm_vs_culture_tb <- res_pm_vs_culture %>%
  data.frame() %>%
  rownames_to_column(var = "gene") %>%
  as_tibble()

sig_pm_vs_culture <- res_pm_vs_culture_tb %>%
  dplyr::filter(padj < padj.cutoff & abs(log2FoldChange) >= lfc.cutoff)
In [54]:
knitr::kable(head(sig_pm_vs_culture))

|gene            |  baseMean| log2FoldChange|     lfcSE|      stat|    pvalue|      padj|
|:---------------|---------:|--------------:|---------:|---------:|---------:|---------:|
|ENSG00000000003 |  572.5146|     -2.7571332| 0.5821881| -4.735112| 0.0000022| 0.0000094|
|ENSG00000000419 | 1175.1155|     -1.4382535| 0.2348224| -6.124895| 0.0000000| 0.0000000|
|ENSG00000000457 |  839.9598|     -1.2503335| 0.3197326| -3.910558| 0.0000921| 0.0002921|
|ENSG00000000460 |  537.7506|     -1.1102258| 0.3106795| -3.573594| 0.0003521| 0.0009824|
|ENSG00000001036 |  536.6069|     -3.2073107| 0.5176828| -6.194693| 0.0000000| 0.0000000|
|ENSG00000001084 |  787.0533|     -0.7939098| 0.3461836| -2.293329| 0.0218291| 0.0400770|

Number of significant genes:

In [55]:
nrow(sig_pm_vs_culture)
12971
In [56]:
summary(res_pm_vs_culture, alpha = 0.05)
out of 28292 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 6372, 23%
LFC < 0 (down)     : 6621, 23%
outliers [1]       : 243, 0.86%
low counts [2]     : 4883, 17%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Gene annotation¶

To perform Gene ID conversion we will use human genome annotation from org.Hs.eg.db package. We use select() function to convert ENSEMBL IDs to gene names (“SYMBOL”) and Entrez identifiers (“ENTREZID”)

In [57]:
mm <- org.Hs.eg.db
my.symbols <- res_pm_vs_culture_tb$gene 
map <- select(mm,
              keys = my.symbols,
              columns = c("ENTREZID", "SYMBOL", "ENSEMBL"),
              keytype = "ENSEMBL")
'select()' returned 1:many mapping between keys and columns

First, we will combine DE result tables with the obtained gene annotation:

In [58]:
res_pm_vs_culture_tb <- left_join(res_pm_vs_culture_tb, map, by = c("gene" = "ENSEMBL"))
In [59]:
datatable(head(res_pm_vs_culture_tb))

Next, we get rid of duplicated ENSEMBL IDs:

In [60]:
res_pm_vs_culture_tb <- res_pm_vs_culture_tb[!duplicated(res_pm_vs_culture_tb$gene),]

Finally, we get the annotation only for the differentially expressed genes:

In [61]:
sig_pm_vs_culture <- res_pm_vs_culture_tb %>% filter(gene %in% sig_pm_vs_culture$gene)

Visualization: Volcano plot¶

This plot shows the direction of gene expression changes (up- or down-regulated) and also highlight the top DE genes.

In [62]:
p1 <- EnhancedVolcano(res_pm_vs_culture_tb,
        lab = res_pm_vs_culture_tb$SYMBOL, 
        x = 'log2FoldChange',
        y = 'padj', 
        title = 'PM vs CULTURE',
        subtitle = NULL,
        pCutoff = 0.05, 
        FCcutoff = 0.58)

p1
No description has been provided for this image

Each point on volcano plot is a DE gene.
On x-axis negative values correspond to genes more highly expressed in culture, and positive values correspond to genes more highly expressed in post-mortem samples.
On y-axis higher value corrsponds to more statistically significant result (lower p-value).
Red dots correspond to genes statistically significant (by p-value) and have a large fold change. Grey are not significant results, blues are significant only by p-value, and greens are significant only by fold change.
So, SOX11, FAM229B, VIM, DLK1, etc. are significant DE genes for culture samples.
Meanwhile, ZBTB16M TESPA1, GAS7, DPP6, GABRA4, etc. are significant DE genes for post-mortem samples.

In [142]:
n_deg <- length(unique(sig_pm_vs_culture$gene))
print(n_deg)
[1] 12971

And the amount of DE genes is 12971.

Heatmap of significant genes¶

We convert normalized_counts to a data frame and transfer the row names to a new column called “gene”

In [63]:
normalized_counts <- counts(dds_analysis, normalized = T) %>%
                data.frame() %>%
                rownames_to_column(var="gene")

We extract normalized expression for significant genes:

In [64]:
norm_sig_pm_vs_culture <- normalized_counts %>%
                filter(gene %in% sig_pm_vs_culture$gene) %>% column_to_rownames('gene')

We plot the heatmap

In [65]:
pheatmap(norm_sig_pm_vs_culture, 
                cluster_rows = T,
                show_rownames = F,
                annotation = meta[,c(1,2)],
                border_color = NA,
                fontsize = 10,
                scale = "row",
                fontsize_row = 10,
                height = 10)
No description has been provided for this image

Here also we can see quite high and clear clusterization between culture samples. Same for post-mortem samples with a little bit less correlation, and between replicates correlation is higher than between samples.

Comparison with differentially significant genes according to methylation¶

All of those are hypomethylated

diff_bed_with_distances_20_all_with_1cov.csv (CpG)
ENST00000510987.1 - CNOT4 (CCR4-NOT transcription complex subunit 4)
ENST00000303221.10 – EMB (embigin)
ENST00000814441.1 – lncRNA
ENST00000838704.1 – lncRNA
ENST00000715202.1 - FAM153B

Non-CpG
ENST00000835172.1 – lncRNA
ENST00000720066.1 – lncRNA
ENST00000829015.1 – lncRNA
ENST00000829014.1 – lncRNA
ENST00000829007.1 – lncRNA
ENST00000303221.10 – EMB
ENST00000659832.1 – lncRNA
ENST00000812067.1 - lncRNA

In [73]:
genes_ENSEMBL <- c('ENST00000510987', 'ENST00000303221', 'ENST00000814441', 
                   'ENST00000838704', 'ENST00000715202', 'ENST00000835172', 
                  'ENST00000720066', 'ENST00000829015', 'ENST00000829014',
                  'ENST00000829007', 'ENST00000303221', 'ENST00000659832',
                  'ENST00000812067')
genes_ENSEMBL
  1. 'ENST00000510987'
  2. 'ENST00000303221'
  3. 'ENST00000814441'
  4. 'ENST00000838704'
  5. 'ENST00000715202'
  6. 'ENST00000835172'
  7. 'ENST00000720066'
  8. 'ENST00000829015'
  9. 'ENST00000829014'
  10. 'ENST00000829007'
  11. 'ENST00000303221'
  12. 'ENST00000659832'
  13. 'ENST00000812067'
In [82]:
methyl <- sig_pm_vs_culture[sig_pm_vs_culture$gene %in% genes_ENSEMBL, ]
In [83]:
methyl
A tibble: 0 × 9
genebaseMeanlog2FoldChangelfcSEstatpvaluepadjENTREZIDSYMBOL
<chr><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>

None of differentially siginificant genes from methylation overlapped with differentially expressed genes from RNA-seq data.

Comparison with cortical layer markers¶

Let's analyze cortical layer marker expression in iPSC-derived neurons

According to:
https://www.pnas.org/doi/10.1073/pnas.1216793109
https://pubmed.ncbi.nlm.nih.gov/21228164/
Cortical layer markers: Tbr1, Ctip2, and Satb2 are expressed in postmitotic neurons, Fezf2 is expressed in cycling cortical progenitors from very early stages of corticogenesis. Fezf2 blocks corticothalamic fate in layer 5 by reducing Tbr1 expression in subcerebral neurons
According to:
https://www.pnas.org/doi/10.1073/pnas.1216793109
Tbr1, EphA4, and Unc5H3 are critical downstream targets of Satb2 in callosal fate specification. This represents a unique role for Tbr1, implicated previously in specifying corticothalamic projections. We further show that Tbr1 expression is dually regulated by Satb2 and Ctip2 in layers 2–5.
According to:
https://www.bio-connect.nl/news/cortical-layers/
LHX2 and PAX6 together play a crucial role in the specification of neo-cortical progenitors which give rise to the projection neurons. MEF2C is another example of transcription factor essential for normal neural development and spatial distribution in the neocortex.
Upper layers neurons can be identified by expression of CUX1 and POU3F2 (BRN2), the neurons of layer V – by expression of BCL11B (CTIP2) and neurons of layer VI – by FOXP2 expression

Summary:
-Deep layers (V–VI): Tbr1, Fezf2, Ctip2

-Upper layers (II–IV): Satb2, Cux1, Pou3f2 (Brn2)

-Pan-neuronal markers: Map2 (maturation), Syt1 (synaptic), Gria2 (glutamatergic)

In [84]:
genes <- c('TBR1', 'FEZF2', 'CTIP2', 'SATB2', 'CUX1', 'POU3F2', 'BRN2', 'MAP2', 'SYT1', 'HEXB', 'CST3', 'GRIA2')
genes_ENSEMBL <- filter(map, SYMBOL %in% genes)$ENSEMBL
genes_ENSEMBL
  1. 'ENSG00000049860'
  2. 'ENSG00000067715'
  3. 'ENSG00000078018'
  4. 'ENSG00000101439'
  5. 'ENSG00000119042'
  6. 'ENSG00000120251'
  7. 'ENSG00000136535'
  8. 'ENSG00000153266'
  9. 'ENSG00000184486'
  10. 'ENSG00000257923'
In [85]:
genes_filtered <- filter(map, ENSEMBL %in% genes_ENSEMBL)$SYMBOL
genes_filtered
  1. 'HEXB'
  2. 'SYT1'
  3. 'MAP2'
  4. 'CST3'
  5. 'SATB2'
  6. 'GRIA2'
  7. 'TBR1'
  8. 'FEZF2'
  9. 'POU3F2'
  10. 'CUX1'
In [86]:
datatable(head(res_pm_vs_culture_tb))
In [87]:
head(sig_pm_vs_culture)
A tibble: 6 × 9
genebaseMeanlog2FoldChangelfcSEstatpvaluepadjENTREZIDSYMBOL
<chr><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>
ENSG00000000003 572.5146-2.75713320.5821881-4.7351122.189342e-069.416691e-067105 TSPAN6
ENSG000000004191175.1155-1.43825350.2348224-6.1248959.074332e-106.737691e-098813 DPM1
ENSG00000000457 839.9598-1.25033350.3197326-3.9105589.208303e-052.920985e-0457147SCYL3
ENSG00000000460 537.7506-1.11022580.3106795-3.5735943.521147e-049.824268e-0455732C1orf112
ENSG00000001036 536.6069-3.20731070.5176828-6.1946935.839856e-104.463415e-092519 FUCA2
ENSG00000001084 787.0533-0.79390980.3461836-2.2933292.182905e-024.007702e-022729 GCLC
In [88]:
cortical <- sig_pm_vs_culture[sig_pm_vs_culture$gene %in% genes_ENSEMBL, ]
In [89]:
datatable(cortical)
In [90]:
p1 <- EnhancedVolcano(cortical,
        lab = cortical$SYMBOL, 
        x = 'log2FoldChange',
        y = 'padj', 
        title = 'Cortical layers markers',
        subtitle = NULL,
        pCutoff = 0.05, 
        FCcutoff = 0.58)
p1
No description has been provided for this image
In [91]:
norm_cortical <- normalized_counts %>%
                filter(gene %in% cortical$gene) %>% column_to_rownames('gene')
In [92]:
pheatmap(norm_cortical, 
                cluster_rows = T,
                show_rownames = F,
                annotation = meta[,c(1,2)],
                border_color = NA,
                fontsize = 10,
                scale = "row",
                fontsize_row = 10,
                height = 10)
No description has been provided for this image

On volcano plots: We can see that the bigger y-value is, the more probable that post-mortem result is different from culture. The bigger the x-value is, the bigger the contribution of this gene to expression. Let's remind as markers: -Deep layers (V–VI): Tbr1, Fezf2, Ctip2

-Upper layers (II–IV): Satb2, Cux1, Pou3f2 (Brn2)

-Pan-neuronal markers: Map2 (maturation), Syt1 (synaptic), Gria2 (glutamatergic)

Pou3f2 and Map2 are seen as significant in culture samples. These results correspond to upper layer and maturation.

Satb2, Tbr1, Fezf2, Gria2, and Syt1 are seen as significant in post-mortem samples. These results correspond to each of category, and probably it is because of a higher heterogenuity in post-mortem samples.

On a heatmap we can clearly see the difference in expression of genes between post-mortem and culture groups.

Gene Ontology¶

In [93]:
ego_pm_vs_culture <- enrichGO(gene = sig_pm_vs_culture$gene, 
                    universe = res_pm_vs_culture_tb$gene, 
                    keyType = "ENSEMBL",
                    OrgDb = org.Hs.eg.db, 
                    ont = "BP",
                    pAdjustMethod = "BH", 
                    pvalueCutoff = 0.05)

We use boxplots to visualize the enriched categories:

In [94]:
barplot(ego_pm_vs_culture)
No description has been provided for this image

Pathway analysis¶

In [95]:
.libPaths()
  1. '/home/Veronika.Karpushenko/R/x86_64-conda-linux-gnu-library/4.2'
  2. '/opt/anaconda3/envs/deseq/lib/R/library'
In [96]:
ekegg <- enrichKEGG(gene = na.omit(sig_pm_vs_culture$ENTREZID),
                    organism = 'hsa',
                    pvalueCutoff = 0.05,
                    universe = na.omit(res_pm_vs_culture_tb$ENTREZID), 
                    use_internal_data = T)
## Now the output of the enrichKEGG() is generated, but the Description column consist of NAs. 
## We will add missing pathway names manually from the KEGG.db we just uploaded.

ekegg@result$Description <- unlist(as.list("KEGG.db"::"KEGGPATHID2NAME")[ekegg@result$ID])
In [97]:
ekegg@result$Description <- sub(" - Homo sapiens \\(human\\)", "", ekegg@result$Description)
In [98]:
library(enrichplot)
In [99]:
ups <- enrichplot::upsetplot(ekegg)
In [100]:
ups
No description has been provided for this image

Gene-Set Enrichment Analysis (GSEA)¶

In [101]:
filtered <- filter(res_pm_vs_culture_tb, ENTREZID != "NA")
In [102]:
foldchanges <- filtered$log2FoldChange
names(foldchanges) <- as.character(filtered$ENTREZID)
In [103]:
foldchanges <- sort(foldchanges, decreasing = TRUE)
In [104]:
## KEEG.db is already loaded

gseaKEGG <- gseKEGG(geneList = foldchanges,              
                    pAdjustMethod = "fdr",
                    organism = "hsa", 
                    nPerm = 2000, 
                    minGSSize = 10, 
                    pvalueCutoff = 0.05, 
                    verbose = FALSE, 
                    seed = TRUE, 
                    use_internal_data = T)

gseaKEGG@result$Description <- unlist(as.list("KEGG.db"::"KEGGPATHID2NAME")[gseaKEGG@result$ID])
In [105]:
gseaKEGG@result$Description <- sub(" - Homo sapiens \\(human\\)", "", gseaKEGG@result$Description)
In [106]:
datatable(gseaKEGG@result)

Below is a GSEA plot for three manually selected enriched pathway:

In [107]:
enrichplot::gseaplot2(gseaKEGG, geneSetID = c("hsa05012", "hsa05131", "hsa04714"), pvalue_table = TRUE,
          color = c("#E495A5", "#86B875", "#7DB0DD"), ES_geom = "dot")
No description has been provided for this image

Gene Set Enrichment Analysis with mSigDB¶

In [108]:
msig_h <- msigdbr(species = "human", category = "H")
In [109]:
msigdbr_t2g <- msig_h %>%
  dplyr::distinct(gs_name, ensembl_gene) %>%
  as.data.frame()
In [110]:
datatable(head(msigdbr_t2g))
In [111]:
hallmarks <- enricher(gene = sig_pm_vs_culture$gene, TERM2GENE = msigdbr_t2g)

Let's visualize the obtained enriched terms

In [112]:
dotplot(hallmarks)
No description has been provided for this image
In [113]:
msigdbr_t2g <- msig_h %>%
  dplyr::distinct(gs_name, entrez_gene)
In [114]:
gsea_h <- GSEA(foldchanges, TERM2GENE = msigdbr_t2g)
preparing geneSet collections...

GSEA analysis...

leading edge analysis...

done...

In [115]:
datatable(head(gsea_h))

Perform GSEA on fold change lists using cortical layer marker genes as the gene set¶

In this task we expect you to prepare a term2gene data frame where the first column corresponds to unique term name ("cortical markers") and the second column consists of cortical markers entrez ids.

In [116]:
genes <- c('TBR1', 'FEZF2', 'CTIP2', 'SATB2', 'CUX1', 'POU3F2', 'BRN2', 'MAP2', 'SYT1', 'HEXB', 'CST3', 'GRIA2')
genes_ENTREZID <- filter(map, SYMBOL %in% genes)$ENTREZID
genes_ENTREZID
  1. '3074'
  2. '6857'
  3. '4133'
  4. '1471'
  5. '23314'
  6. '2891'
  7. '10716'
  8. '55079'
  9. '5454'
  10. '1523'
In [117]:
term2gene <- data.frame(
  term = "CorticalLayerMarkers",
  gene = genes_ENTREZID
)
In [118]:
name2gene <- data.frame(name = genes_filtered, entrez_gene=genes_ENTREZID)
name2gene
A data.frame: 10 × 2
nameentrez_gene
<chr><chr>
HEXB 3074
SYT1 6857
MAP2 4133
CST3 1471
SATB2 23314
GRIA2 2891
TBR1 10716
FEZF2 55079
POU3F25454
CUX1 1523
In [119]:
head(msigdbr_t2g)
dim(msigdbr_t2g)
A tibble: 6 × 2
gs_nameentrez_gene
<chr><int>
HALLMARK_ADIPOGENESIS 19
HALLMARK_ADIPOGENESIS11194
HALLMARK_ADIPOGENESIS10449
HALLMARK_ADIPOGENESIS 33
HALLMARK_ADIPOGENESIS 34
HALLMARK_ADIPOGENESIS 35
  1. 7321
  2. 2
In [120]:
gsea_layer <- GSEA(
  geneList = foldchanges,
  TERM2GENE = term2gene,
  minGSSize = 1,   # as low as possible for small gene sets
  pvalueCutoff = 0.5,  
  verbose = FALSE
)
In [121]:
gsea_layer@result
A data.frame: 1 × 11
IDDescriptionsetSizeenrichmentScoreNESpvaluep.adjustqvaluerankleading_edgecore_enrichment
<chr><chr><int><dbl><dbl><dbl><dbl><lgl><int><chr><chr>
CorticalLayerMarkersCorticalLayerMarkersCorticalLayerMarkers100.51128161.2439030.22037420.2203742NA1101tags=30%, list=5%, signal=29%10716/55079/23314

For cortical layer markers we don't see significant results with GSEA.

Let's perform clustering of DE genes to identify groups of genes with similar expression trends, and perform the enrichment analysis (enrichGO/enrichKEGG) on the groups of our genes.

Hierarchical clustering: https://www.geeksforgeeks.org/hierarchical-clustering-in-r-programming/

In [122]:
# finding distance matrix
distance_mat <- dist(norm_sig_pm_vs_culture, method = 'manhattan')
#distance_mat
 
# fitting Hierarchical clustering model 
# to training dataset
set.seed(240)  # Setting seed
Hierar_cl <- hclust(distance_mat, method = "ward.D2")
Hierar_cl
Call:
hclust(d = distance_mat, method = "ward.D2")

Cluster method   : ward.D2 
Distance         : manhattan 
Number of objects: 12971 
In [123]:
# plotting dendrogram
plot(Hierar_cl, hang = -1, labels = FALSE)

# choosing no. of clusters
# cutting tree by height
abline(h = 3, col = "green")

# cutting tree by no. of clusters
fit <- cutree(Hierar_cl, k = 4 )
#fit
 
table(fit)
rect.hclust(Hierar_cl, k = 4, border = "green")
fit
    1     2     3     4 
12747   217     5     2 
No description has been provided for this image

So, probably we can see 3 distinct clusters at least. Let's check:

Code for k-means and silhouette: https://uc-r.github.io/kmeans_clustering

In [124]:
k2 <- kmeans(norm_sig_pm_vs_culture, centers = 2, nstart = 25)
str(k2)
List of 9
 $ cluster     : Named int [1:12971] 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "names")= chr [1:12971] "ENSG00000000003" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" ...
 $ centers     : num [1:2, 1:9] 1312 44370 1340 46373 1325 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "1" "2"
  .. ..$ : chr [1:9] "CULTURE_327" "CULTURE_381" "CULTURE_379" "PM_878" ...
 $ totss       : num 4.19e+12
 $ withinss    : num [1:2] 1.22e+12 1.86e+12
 $ tot.withinss: num 3.08e+12
 $ betweenss   : num 1.11e+12
 $ size        : int [1:2] 12935 36
 $ iter        : int 1
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
In [125]:
fviz_cluster(k2, data = norm_sig_pm_vs_culture, labelsize = 0) #I' ve hidden labels as they were interfering
No description has been provided for this image
In [126]:
k3 <- kmeans(norm_sig_pm_vs_culture, centers = 3, nstart = 25)
k4 <- kmeans(norm_sig_pm_vs_culture, centers = 4, nstart = 25)
k5 <- kmeans(norm_sig_pm_vs_culture, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = norm_sig_pm_vs_culture) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = norm_sig_pm_vs_culture) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = norm_sig_pm_vs_culture) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = norm_sig_pm_vs_culture) + ggtitle("k = 5")

library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
Attaching package: ‘gridExtra’


The following object is masked from ‘package:Biobase’:

    combine


The following object is masked from ‘package:BiocGenerics’:

    combine


The following object is masked from ‘package:dplyr’:

    combine


No description has been provided for this image
In [127]:
fviz_nbclust(norm_sig_pm_vs_culture, kmeans, method = "silhouette")
No description has been provided for this image

So, according to silhouette the most proabable number of clusters is 2.

enrichGO

In [128]:
changed <- tibble::rownames_to_column(norm_sig_pm_vs_culture, var = "gene")
In [129]:
head(changed)
A data.frame: 6 × 10
geneCULTURE_327CULTURE_381CULTURE_379PM_878PM_882PM_886PM_889PM_892PM_896
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1ENSG000000000031315.12181311.70941387.0062105.47461 27.12929249.5871172.3374341.57374242.6915
2ENSG000000004191981.36142015.83892096.5736592.89667656.13703861.6150771.3531918.14606682.1173
3ENSG000000004571411.15841444.24291262.5745734.89555584.22999451.4038610.2279325.47883735.4258
4ENSG00000000460 824.2443 818.9478 875.4805536.18425487.95587411.5879324.3556282.76720278.2319
5ENSG000000010361293.16861238.04881466.0557 44.55863122.39913101.1396170.5701 84.63007308.8916
6ENSG00000001084 974.64571377.7342 945.0904475.48029465.50208642.9037744.7011461.74628995.6760
In [130]:
ego_sig <- enrichGO(gene = changed$gene,
                                universe = res_pm_vs_culture_tb$gene, 
                                keyType = "ENSEMBL",
                                OrgDb = org.Hs.eg.db, 
                                ont = "BP",
                                pAdjustMethod = "BH", 
                                pvalueCutoff = 0.05)
In [131]:
barplot(ego_sig)
No description has been provided for this image

Okay, let's do the same for cortical layer markers

In [132]:
# finding distance matrix
distance_mat <- dist(norm_cortical, method = 'manhattan')
#distance_mat
 
# fitting hierarchical clustering model 
# to training dataset
set.seed(240)  # setting seed
Hierar_cl <- hclust(distance_mat, method = "ward.D2")
Hierar_cl
Call:
hclust(d = distance_mat, method = "ward.D2")

Cluster method   : ward.D2 
Distance         : manhattan 
Number of objects: 9 
In [133]:
# plotting dendrogram
plot(Hierar_cl)

# choosing no. of clusters
# cutting tree by height
abline(h = 3, col = "green")

# cutting tree by no. of clusters
fit <- cutree(Hierar_cl, k = 4 )
#fit
 
table(fit)
rect.hclust(Hierar_cl, k = 4, border = "green")
fit
1 2 3 4 
1 1 6 1 
No description has been provided for this image
In [134]:
k2 <- kmeans(norm_cortical, centers = 2, nstart = 25)
str(k2)
List of 9
 $ cluster     : Named int [1:9] 1 2 1 1 2 1 1 1 1
  ..- attr(*, "names")= chr [1:9] "ENSG00000067715" "ENSG00000078018" "ENSG00000101439" "ENSG00000119042" ...
 $ centers     : num [1:2, 1:9] 3016 49957 3131 53411 3042 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "1" "2"
  .. ..$ : chr [1:9] "CULTURE_327" "CULTURE_381" "CULTURE_379" "PM_878" ...
 $ totss       : num 4.54e+10
 $ withinss    : num [1:2] 6.54e+09 1.20e+10
 $ tot.withinss: num 1.85e+10
 $ betweenss   : num 2.68e+10
 $ size        : int [1:2] 7 2
 $ iter        : int 1
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
In [135]:
fviz_cluster(k2, data = norm_cortical, labelsize = 0)
No description has been provided for this image
In [136]:
k3 <- kmeans(norm_cortical, centers = 3, nstart = 25)
k4 <- kmeans(norm_cortical, centers = 4, nstart = 25)
k5 <- kmeans(norm_cortical, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = norm_cortical) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = norm_cortical) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = norm_cortical) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = norm_cortical) + ggtitle("k = 5")

grid.arrange(p1, p2, p3, p4, nrow = 2)
No description has been provided for this image
In [137]:
dim(norm_cortical)
  1. 9
  2. 9
In [138]:
fviz_nbclust(norm_cortical, kmeans, method = "silhouette", k.max = 8)
No description has been provided for this image

So for cortical layer markers it's also most probable 3 clusters

In [139]:
changed_cortical <- tibble::rownames_to_column(norm_cortical, var = "gene")
In [140]:
ego_cortical <- enrichGO(gene = changed_cortical$gene,
                                universe = res_pm_vs_culture_tb$gene, 
                                keyType = "ENSEMBL",
                                OrgDb = org.Hs.eg.db, 
                                ont = "BP",
                                pAdjustMethod = "BH", 
                                pvalueCutoff = 0.05)

The enriched categories for cortical layer markers:

In [141]:
barplot(ego_cortical)
No description has been provided for this image
In [ ]:

In [ ]:

In [ ]: